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Project  Goals 


The  Problem 


Data 


We  believe  insider  threat  detection  nnethods  can  be  improved  by 
monitoring  and  analyzing  features  of  user  behavior  not  typically  associated 
with  indicators  of  malicious  insider  behavior.  Anomalous  behaviors  and 
statistical  outliers  observed  in  such  data  sets  may  identify  new  indicators 
or  help  reduce  high  false  positive  rates  associated  with  existing  indicators. 

We  have  three  specific  goals  for  this  project: 

1)  Detect  account  masquerading  by  monitoring  for  sudden  and 
significant  changes  in  IT  system  interactions  by  the  account  user. 

2)  Develop  unique  profiles  of  individual  users  based  on  behavior 
on  IT  systems. 

3)  Baseline  individual  user  behavior  and  monitor  for  changes  that 
indicate  potentially  malicious  insider  behavior  is  likely  to  occur. 

We  intend  to  deliver  the  following  outcomes: 

1 )  A  measure  of  confidence  that  the  person  currently  interacting  with 
the  IT  system  is  or  is  not  the  authorized  user. 

2)  Methods  for  collecting  additional  context  of  user  behavior  by  which 
insider  threat  and  anomaly  detection  engines  can  determine  with 
higher  confidence  that  suspicious  behavior  is  likely  to  be  malicious. 

3)  Visualization  of  these  methods  and  metrics  for  analyst  use. 


We're  not  looking  for  a  needle  in  a  STACK  of  hay... 
we  re  looking  for  a  NEEDLE  in  a  STACK  OF  NEEDLES 


Our  Approach 

Researchers  have  proposed  numerous  methods  for  detecting  anomalous 
user  behavior.  We  intend  to  focus  on  three  methods  in  particular: 

1)  Classifier-Adjusted  Density  Estimation. This  has  been  shown  to  be 
effective  for  anomaly  detection  in  data  with  high  dimensionality 
[Friedland  2014]. 

2)  Latent  Dirichlet  Allocation  (LDA).  Robinson  demonstrates  a  technique 
using  the  LDA  model,  borrowed  from  natural  language  processing,  to 
identify  malicious  exfiltration  events  in  a  large  data  set  of  network 
header  information  [Robinson  2010]. 


3)  Multivariate  Statistic  Analysis  of  linguistic  characteristics  of  user  text. 
Greitzer  and  Ferryman  [Greitzer  2013]  and  Brown  et  al.  [Brown  2013] 
demonstrate  that  statistical  analysis  using  a  variant  of  Chebyshev's 
inequality  can  identify  outliers  in  a  population  of  linguistic 
characteristics  that  correlate  to  persons  with  known  psychological  risk 
indicators. 


Host  Audit:  application  use,  removable  media,  file  activity,  keystrokes, 
registry  entries,  email  activity,  etc. 

Network  Activity:  login  events 

Linguistic  and  Structural:  word  count,  text  structure,  frequency  of  words 
by  category,  etc. 


Linguistic  Patterns 

Characteristics  of  a  user's  speech  or  writing  can  be  measured  both  structurally  and 
linguistically.  Using  these  metrics,  researchers  have  shown  the  feasibility  of  identifying 
anonymous  authors  [Narayanan  2012].  Others  have  observed  measurable  changes  in  linguistic 
patterns  of  known  insiders  [Taylor  2013]. 

Tools 

•  Email  text  and  spoken  words  are  extracted  from  source  repositories  and  processed  to  iso¬ 
late  individual  users.  Identifying  information  is  masked  and  actual  text  is  not  viewed  by 
researchers.  Raw  data  is  parsed  and  prepared  for  analysis  using  a  custom  application 
written  by  research  staff. 

•  Linguistic  analysis  performed  by  the  Linguistic  Inquiry  Word  Count  (LIWC)  tool. 

•  Structural  characteristics  obtained  with  custom  application. 

Initial  Results 


Comparing  data  sets:  are  linguistic  features  of  the  same  user  similar  for 
typed  vs.  spoken  words? 


Pronoun 

1st 

Person 

Singular 

1st 

Person 

Plural 

2nd 

Person 

Verbs 

Swear 

Words 

Achieve¬ 

ment 

Leisure 

Home 

Money 

Email  avg 

7.97% 

2.58% 

0.95% 

1.51% 

11.30% 

0.00% 

1.31% 

0.70% 

0.20% 

0.52% 

Spoken  avg 

11.87% 

3.90% 

1.34% 

2.33% 

15.40% 

0.01% 

1.85% 

0.58% 

0.29% 

0.74% 

Social 
Words:  All 

Social 

Words: 

Family 

Social 

Words: 

Friends 

Social 

Words: 

Humans 

Affect: 

All 

Affect: 

Positive 

Emotions 

Affect: 

Negative 

Emotions 

Affect: 

Neg. 

Emotions: 

Anxiety 

Affect: 

Neg. 

Emotions: 

Anger 

Affect: 

Neg. 

Emotions: 

Sadness 

Email  avg 

6.78% 

0.07% 

0.35% 

0.16% 

3.04% 

2.39% 

0.63% 

0.08% 

0.21% 

0.07% 

Spoken  avg 

9.75% 

0.06% 

0.05% 

0.22% 

5.25% 

4.23% 

0.99% 

0.14% 

0.29% 

0.10% 

Sample  of  comparison  of  email  &  webcast  analysis  from  a  user  in  Population  L,. 


User  Profiles 

Can  features  unique  to  individual  users  be  identified? 


Logarithmic  radar  chart  comparing  linguistic  features  indicating  neuroticism 
[Friedland  2013] 


Heat  map  showing  identification  of  anomalous  network  events 
[Robinson  2010] 


Network  Authentication  Graphs 

These  directed  graphs  represent  a  user's  authentication  activity  between  networked  computers  over  a  predefined 
period.  Research  shows  that  administrative  users  generally  have  larger,  more  complex  graphs  than  normal  users  [Kent 
2013].  Furthermore,  it  is  possible  to  profile  each  user's  authentication  activity,  resulting  in  the  ability  to  detect  abnormal 
and  potentially  malicious  activity.  Empirical  research  on  malicious  insiders  shows  that  many  insiders  engage  in 
reconnaissance  and  information-gathering  activities,  accessing  numerous  network  locations  that  often  differ  from  the 
insider's  normal  work  activity. 

The  figure  to  the  left  is  a  network 
authentication  graph  from  a  typical 
user  with  administrative  access. This 
user  accessed  18  computers  with  41 
authentication  arcs. This  a  more  complex 
authentication  graph  than  those  of 
general  users  [Kent  2013]. 


Keystroke/Mouse  Biometrics 

Using  metrics  like  keystroke  latency  or  mouse  dynamics,  researchers  have  shown  how  individual 
users  can  be  identified  [Shen  2013].  Furthermore,  evidence  suggests  changes  in  a  user's  personal 
state,  such  as  increased  stress,  is  also  detectable. 


The  figure  to  the  left  shows 
cross-validated  error  rates  for 
sparse  models  selected  using 
LASSO  regularized  logistic 
regression. The  grey  line 
shows  the  misclassification 
rate  for  the  null  model. The 
vertical  bars  give  1  standard 
deviation  confidence 
intervals. The  red  points  are 
recipients  whose  model  had 
better  than  default  error,  but 
not  significantly. The  green 
points  are  the  three  recipients 
whose  models  were  at  least  1 
standard  deviation  better  than 
the  null  model. 


Visualization 


Issues  of  concern  must  be  visible  and  apparent  to  analysts.  In  cooperation  with  the  Cyber  Security  Centre 
at  Oxford,  we  will  leverage  an  existing  insider  threat  visualization  toolkit  to  represent  the  data  and 
anomalous  activity  we  find  in  a  clear  and  actionable  manner. 
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